1 Summary

Process:

  • 1st) Simulate population dataset based on questions from recruitment form. This represents a rough guess at the total population of those living in 30 - 60% AMI in Boulder City. This fake dataset is based on initial estimates and/or guesses on demographic parameters (including what the parameters should be). This population dataset is just for the purposes of illustration.

  • 2nd) Randomly sample 4000 applicants from the simulated population data.

  • 3rd) Select first ‘wave’ of 200 program selections using two methods: A - a custom weighting procedure B - a purely random sample of 200 selections from the applicant pool

  • 4th) Select second and third waves using propensity score matching against the applicant pool ((Ho et al. 2011))1.

  • 5th) Make list of additional backups to use for additional verification if needed. Define a process for selecting these additional backup selections, based on prioritizing the least represented groups.

  • Last) Make the dataset with selections and backups available for download (see 6).

2 Assumptions

  • There will be enough recruits into the program that we can have multiple waves of selections within the weighting criteria we define.

  • Failures of verification will be ~randomly distributed across groups.

  • For the sake of the simulations and calculations here (which are just for an abstract presentation of the process), assume there will be 4000 applicants, 200 selections, and 200 backups in each of three sampling waves. We are also assuming that the applicant pool is a random selection from the population (which probably won’t be the case in our intended application).

  • For the purposes of weighting, assume groups are independent. That is, we have estimates for the proportion of the population by racial category and we use these weights to make a random selection, likewise with gender, and disability, etc.

3 Requirements

  • Ideally, make all matches based on estimates of population in Boulder City who are either a) between 30 and 60 % of area median income (AMI) or b) below poverty line. Option a is preferable - b is backup if we encounter data limilations.

  • Proportionate match by race/ethnicty, gender identity, and disability status.

  • Individuals with children under 18 should be represented in the program at ~2xs their estimated representation in the population

4 Questionaire info

*These options will change once we have closed in data for population estimates*

The eligibility questionnaire will have questions on each of the above, plus additional eligibility and other characteristics not addressed here.

Ethnicity/race options:

  • Non-Latino White (e.g., German, Irish, English, Italian)

  • Hispanic, Latinx, or Spanish origin (e.g., Mexican/Mexican American, Puerto Rican, Cuban, Dominican, Salvadoran, Colombian)

  • Black or African American (e.g., African American, Jamaican, Haitian, Nigerian, Ethiopian, Somalian)

  • Asian (e.g., Chinese, Filipino, Asian Indian, Vietnamese, Korean, Japanese)

  • American Indian or Alaska Native (e.g., Navajo Nation, Blackfeet Tribe, Muscogee (Creek) Nation, Mayan, Doyon, Native Village of Barrow Inupiat Traditional Government)

  • Native Hawaiian or Other Pacific Islander (e.g., Native Hawaiian, Samoan, Guamanian or Chamorro, Tongan, Fijian, Marshallese)

  • Middle Eastern or North African (e.g., Lebanese, Egyptian)

  • Not Listed (please specify)

Gender:

  • Woman

  • Man

  • Transgender

  • Non-binary/Gender non-conforming

  • Prefer to self identify (please write in your preferred identity here)

Households with children under 18

  • calculated from general question on household composition, which includes a relationship and birthday question, which are in turn used to calculate if household has children under 18

  • assume this is a binary variable 1/0 for 1 = household with children under 18

Disability status:

  • placeholders for now. We have three types of disability simply called disability 1, 2, and 3.

5 Estimates

This table shows the probabilities that we are working with in the current iteration of our fake data. These are a combination of empirical estimates and rough guesses (for now).

Table 5.1: Parameters for weighting
sub_group target_props
race_ethnicity
White (not latino) 0.756
Hispanic 0.100
Black or African American 0.014
Asian 0.051
American Indian or Alaska Native 0.002
Native Hawaiian or Other Pacific Islander 0.001
Middle Eastern or North African 0.038
Not Listed 0.038
gender
Woman 0.398
Man 0.502
Transgender 0.030
Non-binary/Gender non-conforming 0.030
Prefer to self identify 0.040
child_household
No 0.600
Yes 0.400
disability
None 0.850
Disability1 0.050
Disability2 0.050
Disability3 0.050

This table shows the sums across sub-groups as an initial internal check. They should generally sum to 1. The values for child household have already been manipulated to ensure twice as many households with children are included.

Table 5.2: Proportions for each group (should = 1, a simple comprehension check
group group_sum
child_household 1
disability 1
gender 1
race_ethnicity 1

5.1 Sim data

5.1.1 Population

Fake data for an arbitrary notion of the ‘total population’. This means all the people in Boulder living between 30 and 60% AMI. Right now this is 25000 people.

A few example rows from the simulated population sample:

Table 5.3: Sample rows from our fake data
id race_ethnicity gender child_household disability
18190 White (not latino) Woman No None
18374 White (not latino) Woman Yes None
1018 White (not latino) Woman Yes Disability2
3145 White (not latino) Woman No None
23489 White (not latino) Man Yes None
8901 Asian Non-binary/Gender non-conforming Yes Disability3

5.1.2 Enrollees

Randomly select 4000 from the population.

Table 5.4: Proportions in randomly selected enrollee data
sub_group count proportions target_proportions
child_household
No 2433 0.608 0.600
Yes 1567 0.392 0.400
disability
Disability1 213 0.053 0.050
Disability2 213 0.053 0.050
Disability3 196 0.049 0.050
None 3378 0.845 0.850
gender
Man 1995 0.499 0.502
Non-binary/Gender non-conforming 133 0.033 0.030
Prefer to self identify 180 0.045 0.040
Transgender 123 0.031 0.030
Woman 1569 0.392 0.398
race_ethnicity
American Indian or Alaska Native 9 0.002 0.002
Asian 198 0.050 0.051
Black or African American 57 0.014 0.014
Hispanic 411 0.103 0.100
Middle Eastern or North African 137 0.034 0.038
Native Hawaiian or Other Pacific Islander 5 0.001 0.001
Not Listed 158 0.040 0.038
White (not latino) 3025 0.756 0.756

Note: as a reminder/clarifier, in the above table the ‘proportions’ column is what we observe when we select 4000 rows/individuals from our simulated population data. The target_proportions are the values used to simulate the population data. These values will generally be very similar because when you sample a large-ish population at random you will mostly tend to maintain the proportions of its characteristic parts. No weighting is applied at this step because we assume that those who apply to the program are something like a random sample of all those who could apply (the ‘population’).

5.1.3 Select sample 1

To select the first sample wave of 200 individuals from our 4000 applicant pool we first take a weighted sample of the data using the target proportions in Table 5.4.

The weighting procedure:

  • calculate the expected number of individuals in a sample of 200 if they were in the sample at exactly their expected proportions.

    • For any cases where the expected number of people is less than one person, round up to one person. This seems like a small effect but consider that for a rare characteristic we might expect 0.2 people to have that characteristics in a sample of
      1. By rounding up to one we have increased the odds that someone with this characteristic gets selected by 5xs.
  • For any individuals that have expected counts <= 3, add three to their expected count. This is another way of increasing proportionate representation of rare characteristics.

  • Take a random sample of 25% of the target sample size of 200 and reserve this for individuals with rare characteristics. These are defined by examining the applicant pool and simply counting the characteristics of all the people in the pool. The sample of 50 (25% of 200 of the rarest 50% of characteristics within each group are reserved for inclusion in the final selected sample.

  • The remaining 75% are chosen by a simple weighting from the enrollee pool.

  • Lastly, if any characteristics are present in the enrollee pool but still missing the selected sample, select one person at random with that characteristic and replace someone chosen at random with the most common set of characteritics.

The target proportions in Table 5.4 are based on characteristics of participants, so this first step in the sampling selects more than 200. We then select 200 people for the first sampling wave using the procedure just described.

Table 5.5: Comparing two possible weightings
sub_group props target_counts count_rand proportions_rand count_w proportions_w
race_ethnicity
Native Hawaiian or Other Pacific Islander 0.001 1 NA NA 2 0.010
American Indian or Alaska Native 0.002 1 NA NA 3 0.015
Black or African American 0.014 3 1 0.005 4 0.020
Middle Eastern or North African 0.038 8 8 0.040 14 0.070
Not Listed 0.038 8 3 0.015 12 0.060
Asian 0.051 10 13 0.065 13 0.065
Hispanic 0.100 20 18 0.090 18 0.090
White (not latino) 0.756 151 157 0.785 134 0.670
gender
Transgender 0.030 6 5 0.025 11 0.055
Non-binary/Gender non-conforming 0.030 6 5 0.025 11 0.055
Prefer to self identify 0.040 8 7 0.035 16 0.080
Woman 0.398 80 87 0.435 72 0.360
Man 0.502 100 96 0.480 90 0.450
disability
Disability2 0.050 10 8 0.040 9 0.045
Disability3 0.050 10 13 0.065 18 0.090
Disability1 0.050 10 11 0.055 11 0.055
None 0.850 170 168 0.840 162 0.810
child_household
Yes 0.400 80 72 0.360 77 0.385
No 0.600 120 128 0.640 123 0.615

5.1.4 Select samples 2 and 3

The second wave selection works by taking the wave 1 selection and then using an algorithm to find each individuals closest match from the 3800 individuals remaining in the applicant pool. This is done using a technique called propensity score matching (Ho et al. 2011).

The third wave of sampled individuals is done with the same process.

5.2 Viz the waves

First, lets compare the population data to the applicant data:

Table 5.6: Results across three sampling waves
sub_group target_props target_counts count_w1 props_w1 count_w2 props_w2 count_w3 props_w3
race_ethnicity
Native Hawaiian or Other Pacific Islander 0.001 1 2 0.010 2 0.010 1 0.005
American Indian or Alaska Native 0.002 1 3 0.015 3 0.015 3 0.015
Black or African American 0.014 3 4 0.020 4 0.020 6 0.030
Middle Eastern or North African 0.038 8 14 0.070 12 0.060 11 0.055
Not Listed 0.038 8 12 0.060 13 0.065 15 0.075
Asian 0.051 10 13 0.065 13 0.065 10 0.050
Hispanic 0.100 20 18 0.090 18 0.090 18 0.090
White (not latino) 0.756 151 134 0.670 135 0.675 136 0.680
gender
Transgender 0.030 6 11 0.055 14 0.070 14 0.070
Non-binary/Gender non-conforming 0.030 6 11 0.055 9 0.045 9 0.045
Prefer to self identify 0.040 8 16 0.080 17 0.085 14 0.070
Woman 0.398 80 72 0.360 71 0.355 70 0.350
Man 0.502 100 90 0.450 89 0.445 93 0.465
disability
Disability2 0.050 10 9 0.045 7 0.035 6 0.030
Disability3 0.050 10 18 0.090 16 0.080 20 0.100
Disability1 0.050 10 11 0.055 12 0.060 9 0.045
None 0.850 170 162 0.810 165 0.825 165 0.825
child_household
Yes 0.400 80 77 0.385 75 0.375 78 0.390
No 0.600 120 123 0.615 125 0.625 122 0.610
Proportions by race group in simulated population data.

Figure 5.1: Proportions by race group in simulated population data.

Proportions by gender in simulated population data.

Figure 5.2: Proportions by gender in simulated population data.

We can examine just the race and gender breakdowns, above, to see that randomly sampling 4000 individuals from our population of 25000 leads to proportions in each group that are fairly similar.

Next, we can see how the proportions in each sampling wave compare to the ‘target’ proportions in the population data:

Proportions by racial grouping, sampling waves.

Figure 5.3: Proportions by racial grouping, sampling waves.

Proportions by gender, sampling waves.

Figure 5.4: Proportions by gender, sampling waves.

Proportions of households with a child in the home, by sampling wave.

Figure 5.5: Proportions of households with a child in the home, by sampling wave.

Proportions by disability status, sampling wave.

Figure 5.6: Proportions by disability status, sampling wave.

6 Appendix A: Example datasets

The first example dataset presents a column for each sampling wave. The intended use is that all the individuals in the far left column, Wave 1, are selected to the program for verification. If some of these individuals cannot be verified, their replacement is the cell in the same row immediately to the right, in the Wave 2 column. If someone in Wave 3 cannot be verified, then proceed to Wave 3.

Table 6.1: Suggested format for the ‘simple’ version of the sample waves using the example data generated above.
Table 6.2: Suggested format for the extended, wide, version of the sample waves using the example data generated above showing all attributes for each matched sample wave.

References

Ho, Daniel E., Kosuke Imai, Gary King, and Elizabeth A. Stuart. 2011. MatchIt: Nonparametric Preprocessing for Parametric Causal Inference” 42. https://doi.org/10.18637/jss.v042.i08.

  1. Propensity score matching is a technique often used in quasi-experimental designs for statistically matching members of a treatment group to members of a control group. In our case, we use the same kind of algorithm to match each participant in sampling waves 2 and 3 with their most similar counter part in the applicant pool.↩︎